305 research outputs found
A critical analysis of self-supervision, or what we can learn from a single image
We look critically at popular self-supervision techniques for learning deep
convolutional neural networks without manual labels. We show that three
different and representative methods, BiGAN, RotNet and DeepCluster, can learn
the first few layers of a convolutional network from a single image as well as
using millions of images and manual labels, provided that strong data
augmentation is used. However, for deeper layers the gap with manual
supervision cannot be closed even if millions of unlabelled images are used for
training. We conclude that: (1) the weights of the early layers of deep
networks contain limited information about the statistics of natural images,
that (2) such low-level statistics can be learned through self-supervision just
as well as through strong supervision, and that (3) the low-level statistics
can be captured via synthetic transformations instead of using a large image
dataset.Comment: Accepted paper at the International Conference on Learning
Representations (ICLR) 202
Learning without Prejudice: Avoiding Bias in Webly-Supervised Action Recognition
Webly-supervised learning has recently emerged as an alternative paradigm to
traditional supervised learning based on large-scale datasets with manual
annotations. The key idea is that models such as CNNs can be learned from the
noisy visual data available on the web. In this work we aim to exploit web data
for video understanding tasks such as action recognition and detection. One of
the main problems in webly-supervised learning is cleaning the noisy labeled
data from the web. The state-of-the-art paradigm relies on training a first
classifier on noisy data that is then used to clean the remaining dataset. Our
key insight is that this procedure biases the second classifier towards samples
that the first one understands. Here we train two independent CNNs, a RGB
network on web images and video frames and a second network using temporal
information from optical flow. We show that training the networks independently
is vastly superior to selecting the frames for the flow classifier by using our
RGB network. Moreover, we show benefits in enriching the training set with
different data sources from heterogeneous public web databases. We demonstrate
that our framework outperforms all other webly-supervised methods on two public
benchmarks, UCF-101 and Thumos'14.Comment: Submitted to CVIU SI: Computer Vision and the We
Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data
We present Viewset Diffusion, a diffusion-based generator that outputs 3D
objects while only using multi-view 2D data for supervision. We note that there
exists a one-to-one mapping between viewsets, i.e., collections of several 2D
views of an object, and 3D models. Hence, we train a diffusion model to
generate viewsets, but design the neural network generator to reconstruct
internally corresponding 3D models, thus generating those too. We fit a
diffusion model to a large number of viewsets for a given category of objects.
The resulting generator can be conditioned on zero, one or more input views.
Conditioned on a single view, it performs 3D reconstruction accounting for the
ambiguity of the task and allowing to sample multiple solutions compatible with
the input. The model performs reconstruction efficiently, in a feed-forward
manner, and is trained using only rendering losses using as few as three views
per viewset. Project page: szymanowiczs.github.io/viewset-diffusion.Comment: International Conference on Computer Vision 202
What does CLIP know about a red circle? Visual prompt engineering for VLMs
Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot
classification to text-to-image generation. Despite that, their capabilities
for solving novel discriminative tasks via prompting fall behind those of large
language models, such as GPT-3. Here we explore the idea of visual prompt
engineering for solving computer vision tasks beyond classification by editing
in image space instead of text. In particular, we discover an emergent ability
of CLIP, where, by simply drawing a red circle around an object, we can direct
the model's attention to that region, while also maintaining global
information. We show the power of this simple approach by achieving
state-of-the-art in zero-shot referring expressions comprehension and strong
performance in keypoint localization tasks. Finally, we draw attention to some
potential ethical concerns of large language-vision models
- …